47 research outputs found

    Extending Compositional Attention Networks for Social Reasoning in Videos

    Full text link
    We propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC), and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art

    Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

    Full text link
    The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25% relative improvement over our baselines.Comment: Interspeech 202

    Investigating Personalization Methods in Text to Music Generation

    Full text link
    In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.Comment: Submitted to ICASSP 2024, Examples at https://zelaki.github.io

    The relationship between musical training and the processing of audiovisual correspondences: Evidence from a reaction time task

    Get PDF
    Numerous studies have reported both cortical and functional changes for visual, tactile, and auditory brain areas in musicians, which have been attributed to long-term training induced neuroplasticity. Previous investigations have reported advantages for musicians in multisensory processing at the behavioural level, however, multisensory integration with tasks requiring higher level cognitive processing has not yet been extensively studied. Here, we investigated the association between musical expertise and the processing of audiovisual crossmodal correspondences in a decision reaction-time task. The visual display varied in three dimensions (elevation, symbolic and non-symbolic magnitude), while the auditory stimulus varied in pitch. Congruency was based on a set of newly learned abstract rules: “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, and accuracy and reaction times were recorded. Musicians were significantly more accurate in their responses than non-musicians, suggesting an association between long-term musical training and audiovisual integration. Contrary to what was hypothesized, no differences in reaction times were found. The musicians’ advantage on accuracy was also observed for rule-based congruency in seemingly unrelated stimuli (pitch-magnitude). These results suggest an interaction between implicit and explicit processing–as reflected on reaction times and accuracy, respectively. This advantage was generalised on congruency in otherwise unrelated stimuli (pitch-magnitude pairs), suggesting an advantage on processes requiring higher order cognitive functions. The results support the notion that accuracy and latency measures may reflect different processes
    corecore